rm(list = ls())
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
library(tidyverse)
library(funModeling)
library(caret)
library(VIM)
library(mice)
library(ggcorrplot)
library(plotly)
library(pROC)
library(lubridate)
library(glmnet)
library(broom)
library(MASS)
library(usmap)
library(RColorBrewer)
set.seed(1)
set_dir <- '/Users/shuheimiyasaka/Google Drive/BST 260 Final Project/For Submission'

Motivation

As graduate students, debt is all around us. Fortunately, we are also data scientists, and data is all around us, too. Most of the major milestones in life - such as graduating from college or a graduate program, or buying a house - require taking on debt. A report from Time’s Money Magazine based on the Federal Reserve’s Survey of Consumer Finances found that on average, Americans under 35 owe $67,400. For middle-aged Americans, the average is even higher, ranging from $108,300 for 55-64 year olds to $134,600 for 45-54 year olds (http://time.com/money/5233033/average-debt-every-age/).

We are interested in digging deeper into how loan applications are considered, and better understanding factors which might be considered by lending agents when applying for a loan, from the applicant’s perspective. Alternatively, our analysis could be useful from the lender’s perspective in identifying factors which predict loan defaults.

The goal for our final project was to build a prediction model using the LendingClub loans data set. We wanted to build a model that could predict a dichotomized outcome of loan status (which will be explained in more detail below) using the variables given in the data set. In this page, we will walk you through our process for building our final prediction model.

Data Exploration

Our data is available via kaggle at https://www.kaggle.com/wendykan/lending-club-loan-data. This data is unique from most other financial institution’s data because of the lending method used by ‘LendingClub’. Headquartered in San Francisco, LendingClub connects borrowers applying for personal loans, auto refinancing, business loans, and elective medical procedures with investors. LendingClub reports that it is America’s largest online marketplace, and emphasizes the digital aspects of its model (https://www.lendingclub.com/). LendingClub also services the loans, and therefore maintains data on the loans’ statuses, as well as information about the loan application.

Data Cleaning

setwd(set_dir)
load('./loan.Rdata')
#loan.dat <- read.csv('loan.csv', header = TRUE)
#save(loan.dat, file = "./loan.RData")

The data set has 887,379 records with 74 variables.

dim(loan.dat)
## [1] 887379     74
names(loan.dat)
##  [1] "id"                          "member_id"                  
##  [3] "loan_amnt"                   "funded_amnt"                
##  [5] "funded_amnt_inv"             "term"                       
##  [7] "int_rate"                    "installment"                
##  [9] "grade"                       "sub_grade"                  
## [11] "emp_title"                   "emp_length"                 
## [13] "home_ownership"              "annual_inc"                 
## [15] "verification_status"         "issue_d"                    
## [17] "loan_status"                 "pymnt_plan"                 
## [19] "url"                         "desc"                       
## [21] "purpose"                     "title"                      
## [23] "zip_code"                    "addr_state"                 
## [25] "dti"                         "delinq_2yrs"                
## [27] "earliest_cr_line"            "inq_last_6mths"             
## [29] "mths_since_last_delinq"      "mths_since_last_record"     
## [31] "open_acc"                    "pub_rec"                    
## [33] "revol_bal"                   "revol_util"                 
## [35] "total_acc"                   "initial_list_status"        
## [37] "out_prncp"                   "out_prncp_inv"              
## [39] "total_pymnt"                 "total_pymnt_inv"            
## [41] "total_rec_prncp"             "total_rec_int"              
## [43] "total_rec_late_fee"          "recoveries"                 
## [45] "collection_recovery_fee"     "last_pymnt_d"               
## [47] "last_pymnt_amnt"             "next_pymnt_d"               
## [49] "last_credit_pull_d"          "collections_12_mths_ex_med" 
## [51] "mths_since_last_major_derog" "policy_code"                
## [53] "application_type"            "annual_inc_joint"           
## [55] "dti_joint"                   "verification_status_joint"  
## [57] "acc_now_delinq"              "tot_coll_amt"               
## [59] "tot_cur_bal"                 "open_acc_6m"                
## [61] "open_il_6m"                  "open_il_12m"                
## [63] "open_il_24m"                 "mths_since_rcnt_il"         
## [65] "total_bal_il"                "il_util"                    
## [67] "open_rv_12m"                 "open_rv_24m"                
## [69] "max_bal_bc"                  "all_util"                   
## [71] "total_rev_hi_lim"            "inq_fi"                     
## [73] "total_cu_tl"                 "inq_last_12m"
meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
meta_loans[order(-meta_loans$p_na),]

As part of data exploration, we examined with percentage of “zeros”, missing records, and unique values in the data set per variable as shown above. From the table above, we notice a number of variables with significant amount of missing data.

Based on examining the data set and reading the data dictionary, we decided to immediately rule out the following variables from our model: id, member_id, url, and desc.

cols.2.remove <- c('id', 'member_id', 'url', 'desc')

We decided to exclude variables with more than 10% missing data (19 variables).

missing.data.col <- meta_loans$variable[meta_loans$p_na > 10.]
missing.data.col
##  [1] "mths_since_last_delinq"      "mths_since_last_record"     
##  [3] "mths_since_last_major_derog" "annual_inc_joint"           
##  [5] "dti_joint"                   "open_acc_6m"                
##  [7] "open_il_6m"                  "open_il_12m"                
##  [9] "open_il_24m"                 "mths_since_rcnt_il"         
## [11] "total_bal_il"                "il_util"                    
## [13] "open_rv_12m"                 "open_rv_24m"                
## [15] "max_bal_bc"                  "all_util"                   
## [17] "inq_fi"                      "total_cu_tl"                
## [19] "inq_last_12m"
length(missing.data.col)
## [1] 19
cols.2.remove <- c(cols.2.remove, missing.data.col)
meta_loans[order(meta_loans$unique),]
cols.2.remove <- c(cols.2.remove, 'policy_code')

We also decided to remove policy_code since it only has one unique value.

At this point, we had 50 potential covariates:

cols.2.keep <- !(colnames(loan.dat) %in% cols.2.remove)
colnames(loan.dat)[cols.2.keep]
##  [1] "loan_amnt"                  "funded_amnt"               
##  [3] "funded_amnt_inv"            "term"                      
##  [5] "int_rate"                   "installment"               
##  [7] "grade"                      "sub_grade"                 
##  [9] "emp_title"                  "emp_length"                
## [11] "home_ownership"             "annual_inc"                
## [13] "verification_status"        "issue_d"                   
## [15] "loan_status"                "pymnt_plan"                
## [17] "purpose"                    "title"                     
## [19] "zip_code"                   "addr_state"                
## [21] "dti"                        "delinq_2yrs"               
## [23] "earliest_cr_line"           "inq_last_6mths"            
## [25] "open_acc"                   "pub_rec"                   
## [27] "revol_bal"                  "revol_util"                
## [29] "total_acc"                  "initial_list_status"       
## [31] "out_prncp"                  "out_prncp_inv"             
## [33] "total_pymnt"                "total_pymnt_inv"           
## [35] "total_rec_prncp"            "total_rec_int"             
## [37] "total_rec_late_fee"         "recoveries"                
## [39] "collection_recovery_fee"    "last_pymnt_d"              
## [41] "last_pymnt_amnt"            "next_pymnt_d"              
## [43] "last_credit_pull_d"         "collections_12_mths_ex_med"
## [45] "application_type"           "verification_status_joint" 
## [47] "acc_now_delinq"             "tot_coll_amt"              
## [49] "tot_cur_bal"                "total_rev_hi_lim"
length(colnames(loan.dat)[cols.2.keep])
## [1] 50
loan.dat <- loan.dat[, cols.2.keep]

We also decided to remove 6 records with missing or zero annual income since we felt this information was a requirement for obtaining a loan and a covariate that we must definitely include in our final model (and didn’t feel we could impute these values properly)!

query = loan.dat$annual_inc == 0.
query.na = is.na(query)
if (sum(query.na) > 0){
  query[query.na] = TRUE
}
if (sum(query) > 0){
  loan.dat = loan.dat[!query,]
} else stop('unexpected case')

With the remaining set of records and covariates, we decided to examine the pairwise correlation of covariates:

meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
numeric_cols <- meta_loans$variable[meta_loans$type == 'numeric']

cor.dat <- cor(loan.dat[,numeric_cols], loan.dat[,numeric_cols])
plot_ly(x=colnames(cor.dat), 
        y=rownames(cor.dat), 
        z = cor.dat, type = "heatmap", colorscale="Greys")
#ggcorrplot(cor(loan.dat[,numeric_cols]))
#aggr(loan.dat, combined=T, cex.axis=0.6)

We notice from the plot above that there are a few covariates that are highly correlated (which is not unexpected).

We also calculated basic summary statistics of our covariates to help us better understand the data:

summary(loan.dat)
##    loan_amnt      funded_amnt    funded_amnt_inv         term       
##  Min.   :  500   Min.   :  500   Min.   :    0    36 months:621119  
##  1st Qu.: 8000   1st Qu.: 8000   1st Qu.: 8000    60 months:266254  
##  Median :13000   Median :13000   Median :13000                      
##  Mean   :14755   Mean   :14742   Mean   :14703                      
##  3rd Qu.:20000   3rd Qu.:20000   3rd Qu.:20000                      
##  Max.   :35000   Max.   :35000   Max.   :35000                      
##                                                                     
##     int_rate      installment      grade        sub_grade     
##  Min.   : 5.32   Min.   :  15.67   A:148198   B3     : 56323  
##  1st Qu.: 9.99   1st Qu.: 260.71   B:254535   B4     : 55626  
##  Median :12.99   Median : 382.55   C:245859   C1     : 53387  
##  Mean   :13.25   Mean   : 436.72   D:139541   C2     : 52235  
##  3rd Qu.:16.20   3rd Qu.: 572.60   E: 70705   C3     : 50161  
##  Max.   :28.99   Max.   :1445.46   F: 23046   C4     : 48857  
##                                    G:  5489   (Other):570784  
##             emp_title          emp_length      home_ownership  
##                  : 51451   10+ years:291569   ANY     :     3  
##  Teacher         : 13469   2 years  : 78870   MORTGAGE:443555  
##  Manager         : 11240   < 1 year : 70601   NONE    :    46  
##  Registered Nurse:  5525   3 years  : 70026   OTHER   :   182  
##  Owner           :  5376   1 year   : 57095   OWN     : 87470  
##  RN              :  5355   5 years  : 55704   RENT    :356117  
##  (Other)         :794957   (Other)  :263508                    
##    annual_inc           verification_status     issue_d      
##  Min.   :   1200   Not Verified   :266744   Oct-2015: 48631  
##  1st Qu.:  45000   Source Verified:329558   Jul-2015: 45962  
##  Median :  65000   Verified       :291071   Dec-2015: 44341  
##  Mean   :  75028                            Oct-2014: 38782  
##  3rd Qu.:  90000                            Nov-2015: 37529  
##  Max.   :9500000                            Aug-2015: 35886  
##                                             (Other) :636242  
##              loan_status     pymnt_plan               purpose      
##  Current           :601777   n:887363   debt_consolidation:524214  
##  Fully Paid        :207723   y:    10   credit_card       :206181  
##  Charged Off       : 45248              home_improvement  : 51829  
##  Late (31-120 days): 11591              other             : 42890  
##  Issued            :  8460              major_purchase    : 17277  
##  In Grace Period   :  6253              small_business    : 10377  
##  (Other)           :  6321              (Other)           : 34605  
##                      title           zip_code        addr_state    
##  Debt consolidation     :414000   945xx  :  9770   CA     :129517  
##  Credit card refinancing:164330   750xx  :  9417   NY     : 74082  
##  Home improvement       : 40112   112xx  :  9272   TX     : 71136  
##  Other                  : 31892   606xx  :  8641   FL     : 60935  
##  Debt Consolidation     : 15760   300xx  :  8126   IL     : 35476  
##  Major purchase         : 12051   100xx  :  7605   NJ     : 33256  
##  (Other)                :209228   (Other):834542   (Other):482971  
##       dti           delinq_2yrs      earliest_cr_line  inq_last_6mths   
##  Min.   :   0.00   Min.   : 0.0000   Aug-2001:  6659   Min.   : 0.0000  
##  1st Qu.:  11.91   1st Qu.: 0.0000   Aug-2000:  6529   1st Qu.: 0.0000  
##  Median :  17.65   Median : 0.0000   Oct-2000:  6322   Median : 0.0000  
##  Mean   :  18.13   Mean   : 0.3144   Oct-2001:  6154   Mean   : 0.6946  
##  3rd Qu.:  23.95   3rd Qu.: 0.0000   Aug-2002:  6086   3rd Qu.: 1.0000  
##  Max.   :1092.52   Max.   :39.0000   Sep-2000:  5918   Max.   :33.0000  
##                    NA's   :25        (Other) :849705   NA's   :25       
##     open_acc        pub_rec          revol_bal         revol_util    
##  Min.   : 0.00   Min.   : 0.0000   Min.   :      0   Min.   :  0.00  
##  1st Qu.: 8.00   1st Qu.: 0.0000   1st Qu.:   6443   1st Qu.: 37.70  
##  Median :11.00   Median : 0.0000   Median :  11875   Median : 56.00  
##  Mean   :11.55   Mean   : 0.1953   Mean   :  16921   Mean   : 55.07  
##  3rd Qu.:14.00   3rd Qu.: 0.0000   3rd Qu.:  20829   3rd Qu.: 73.60  
##  Max.   :90.00   Max.   :86.0000   Max.   :2904836   Max.   :892.30  
##  NA's   :25      NA's   :25                          NA's   :498     
##    total_acc      initial_list_status   out_prncp     out_prncp_inv  
##  Min.   :  1.00   f:456843            Min.   :    0   Min.   :    0  
##  1st Qu.: 17.00   w:430530            1st Qu.:    0   1st Qu.:    0  
##  Median : 24.00                       Median : 6459   Median : 6456  
##  Mean   : 25.27                       Mean   : 8403   Mean   : 8400  
##  3rd Qu.: 32.00                       3rd Qu.:13659   3rd Qu.:13654  
##  Max.   :169.00                       Max.   :49373   Max.   :49373  
##  NA's   :25                                                          
##   total_pymnt    total_pymnt_inv total_rec_prncp total_rec_int    
##  Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0.0  
##  1st Qu.: 1915   1st Qu.: 1900   1st Qu.: 1201   1st Qu.:  441.5  
##  Median : 4895   Median : 4862   Median : 3215   Median : 1073.3  
##  Mean   : 7559   Mean   : 7521   Mean   : 5758   Mean   : 1754.8  
##  3rd Qu.:10617   3rd Qu.:10566   3rd Qu.: 8000   3rd Qu.: 2238.3  
##  Max.   :57778   Max.   :57778   Max.   :35000   Max.   :24205.6  
##                                                                   
##  total_rec_late_fee   recoveries       collection_recovery_fee
##  Min.   :  0.0000   Min.   :    0.00   Min.   :   0.000       
##  1st Qu.:  0.0000   1st Qu.:    0.00   1st Qu.:   0.000       
##  Median :  0.0000   Median :    0.00   Median :   0.000       
##  Mean   :  0.3967   Mean   :   45.92   Mean   :   4.881       
##  3rd Qu.:  0.0000   3rd Qu.:    0.00   3rd Qu.:   0.000       
##  Max.   :358.6800   Max.   :33520.27   Max.   :7002.190       
##                                                               
##    last_pymnt_d    last_pymnt_amnt     next_pymnt_d    last_credit_pull_d
##  Jan-2016:470148   Min.   :    0.0   Feb-2016:553404   Jan-2016:730572   
##  Dec-2015:150861   1st Qu.:  280.2           :252971   Dec-2015: 19308   
##          : 17659   Median :  462.8   Jan-2016: 78195   Nov-2015: 11490   
##  Oct-2015: 16000   Mean   : 2164.2   Mar-2011:   107   Oct-2015: 10419   
##  Jul-2015: 14483   3rd Qu.:  831.2   Apr-2011:   101   Sep-2015: 10087   
##  Nov-2015: 13981   Max.   :36475.6   Feb-2011:    91   Jul-2015:  8642   
##  (Other) :204241                     (Other) :  2504   (Other) : 96855   
##  collections_12_mths_ex_med   application_type 
##  Min.   : 0.00000           INDIVIDUAL:886864  
##  1st Qu.: 0.00000           JOINT     :   509  
##  Median : 0.00000                              
##  Mean   : 0.01438                              
##  3rd Qu.: 0.00000                              
##  Max.   :20.00000                              
##  NA's   :141                                   
##    verification_status_joint acc_now_delinq       tot_coll_amt    
##                 :886864      Min.   : 0.000000   Min.   :      0  
##  Not Verified   :   281      1st Qu.: 0.000000   1st Qu.:      0  
##  Source Verified:    61      Median : 0.000000   Median :      0  
##  Verified       :   167      Mean   : 0.004991   Mean   :    226  
##                              3rd Qu.: 0.000000   3rd Qu.:      0  
##                              Max.   :14.000000   Max.   :9152545  
##                              NA's   :25          NA's   :70272    
##   tot_cur_bal      total_rev_hi_lim 
##  Min.   :      0   Min.   :      0  
##  1st Qu.:  29853   1st Qu.:  13900  
##  Median :  80559   Median :  23700  
##  Mean   : 139458   Mean   :  32069  
##  3rd Qu.: 208205   3rd Qu.:  39800  
##  Max.   :8000078   Max.   :9999999  
##  NA's   :70272     NA's   :70272

Preliminary Analysis, Visualization, and Feature Engineering

After conducting a preliminary data exploration, we got a much better sense of our dataset. Based on what we learned, we dropped further covariates and re-categorized some of the variables which we will explain in this section.

In our prediction model, we decided to predict loan status. We dichotomized loan status into “Good” and “Bad” based on the following criteria: Good
1. Fully Paid
2. Current

Bad
1. Default
2. Charged Off
3. Late (16-30 days)
4. Late (31-120 days)

Since we are predicting loan status using our model, we decided to drop records with loan_status with values Issued:

query <- loan.dat$loan_status != 'Issued'
loan.dat <- loan.dat[query, ]

loan.dat$loan_status_bin <- NA
query = loan.dat$loan_status == 'Fully Paid' | loan.dat$loan_status == 'Current'
loan.dat$loan_status_bin[query] = 'Good'

query = loan.dat$loan_status == 'Charged Off' | loan.dat$loan_status == 'Default' |
  loan.dat$loan_status == 'Late (16-30 days)' | loan.dat$loan_status == 'Late (31-120 days)'
loan.dat$loan_status_bin[query] = 'Bad'

query <- !is.na(loan.dat$loan_status_bin)
loan.dat <- loan.dat[query, ]

loan.dat$loan_status_bin = as.factor(loan.dat$loan_status_bin)
summary(loan.dat$loan_status_bin)
##    Bad   Good 
##  60415 809500

We also converted funded_amnt_inv to be percent funded amount by investors perc_funded_amnt_inv and only use the year from issue date (issue_d):

loan.dat <- loan.dat %>%
  mutate(perc_funded_amnt_inv = funded_amnt_inv/funded_amnt,
         issue_d = as.character(issue_d),
         term = as.character(term)) %>%
  mutate(year = as.numeric(str_sub(issue_d, start = -4)))

We reclassified a 36 month loan to “Short” and a 60 month loan to “Long”.

query <- loan.dat$term == ' 36 months'
loan.dat$term[query] = 'Short'
loan.dat$term[!query] = 'Long'
loan.dat$term = as.factor(loan.dat$term)

We also dichotomized tot_coll_amt to 0 and greater than 0 (and replaced the missing values to zero):

query.na <- is.na(loan.dat$tot_coll_amt)
if (sum(query.na) >0){
  loan.dat$tot_coll_amt[query.na] = 0
}
loan.dat <- loan.dat %>%
  mutate(tot_coll_amt_gt0 = as.factor(tot_coll_amt > 0.))

We recoded the grade variable to be ordinal:

loan.dat$grade_ordinal <- NA
loan.dat <- loan.dat %>% 
  mutate(grade = as.character(grade))
grades <- c('A', 'B', 'C', 'D', 'E', 'F', 'G')

counter = 1
for (grade in grades){
  
  query <- loan.dat$grade == grade
  if (sum(query) > 0){
    loan.dat$grade_ordinal[query] = as.numeric(counter)
  } 
  counter = counter + 1
}

sum(is.na(loan.dat$grade_ordinal))
## [1] 0

We also recoded the emp_length variable to be ordinal (and removed records with no employment length information):

loan.dat$emp_length_ordinal <- NA
loan.dat <- loan.dat %>% 
  mutate(emp_length = as.character(emp_length))

loan.dat <- loan.dat %>% filter(!(emp_length == 'n/a'))

emp_lengths <- c('< 1 year', '1 year',
                 '2 years', '3 years',
                 '4 years', '5 years',
                 '6 years', '7 years',
                 '8 years', '9 years',
                 '10+ years')

counter = 1
for (emp_length in emp_lengths){
  
  query <- loan.dat$emp_length == emp_length
  if (sum(query) > 0){
    loan.dat$emp_length_ordinal[query] = as.numeric(counter)
  } 
  counter = counter + 1
}

sum(is.na(loan.dat$emp_length_ordinal))
## [1] 0

Based on examining the uni-variate plots and the correlation plot, we decided to keep the following 27 covariates:
### we can probably add a more detailed reason for dropping certain columns as outline in google docs

predictors <- c('loan_amnt', 'funded_amnt',
                'grade_ordinal',
                'emp_length_ordinal', 'home_ownership',
                'annual_inc', 'verification_status',
                'purpose',
                'addr_state', 'dti',
                'delinq_2yrs', 'inq_last_6mths',
                'open_acc', 'pub_rec', 
                'revol_bal', 'revol_util',
                'total_acc', 'initial_list_status',
                'application_type', 'acc_now_delinq',
                'tot_coll_amt_gt0', 'tot_cur_bal',
                'total_rev_hi_lim', 'perc_funded_amnt_inv',
                'term')
predictors
##  [1] "loan_amnt"            "funded_amnt"          "grade_ordinal"       
##  [4] "emp_length_ordinal"   "home_ownership"       "annual_inc"          
##  [7] "verification_status"  "purpose"              "addr_state"          
## [10] "dti"                  "delinq_2yrs"          "inq_last_6mths"      
## [13] "open_acc"             "pub_rec"              "revol_bal"           
## [16] "revol_util"           "total_acc"            "initial_list_status" 
## [19] "application_type"     "acc_now_delinq"       "tot_coll_amt_gt0"    
## [22] "tot_cur_bal"          "total_rev_hi_lim"     "perc_funded_amnt_inv"
## [25] "term"
outcome <- c('loan_status_bin')

loan.dat <- loan.dat[, c(outcome, predictors)]

We also did a simple uni-variate analysis of the covariates using histograms and boxplots:

meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
numeric_cols <- meta_loans$variable[meta_loans$type == 'numeric']

for (col_name in numeric_cols){
  plt <- loan.dat %>% ggplot(aes(loan.dat[, col_name])) +
    geom_histogram(color = "black") + 
    ggtitle(col_name) + labs(x=col_name) 
  print(plt)
}

for (col_name in numeric_cols){
  plt <- loan.dat %>% ggplot(aes(x=loan_status_bin, loan.dat[, col_name])) +
    geom_boxplot() + 
    ggtitle(col_name) + labs(x='Loan Status', y=col_name)
  print(plt)
}

At this point, we still had a few records with missing data. We initially tried imputing these values using the mice package but due to computational reasons, we decided to drop this idea.

meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
meta_loans[order(-meta_loans$p_na),]
# loan.dat = mice(loan.dat, m=1)  # let's just impute one dataset
# loan.dat = complete(loan.dat, 1)

Instead, we decided to set the missing values to the mean using the rest of the data:

sum(is.na(loan.dat$loan_status_bin))
## [1] 0
for (j in 1:ncol(loan.dat)){
  miss = is.na(loan.dat[,j])
  if (sum(miss) > 0){
    loan.dat[miss, j] = mean(loan.dat[,j], na.rm=T)
  }
}
sum(is.na(loan.dat))
## [1] 0

Here is a final basic summary statistics of the remaining covariates:

##  loan_status_bin   loan_amnt      funded_amnt    grade_ordinal  
##  Bad : 56958     Min.   :  500   Min.   :  500   Min.   :1.000  
##  Good:769037     1st Qu.: 8375   1st Qu.: 8325   1st Qu.:2.000  
##                  Median :13225   Median :13200   Median :3.000  
##                  Mean   :14918   Mean   :14904   Mean   :2.789  
##                  3rd Qu.:20000   3rd Qu.:20000   3rd Qu.:4.000  
##                  Max.   :35000   Max.   :35000   Max.   :7.000  
##                                                                 
##  emp_length_ordinal  home_ownership     annual_inc     
##  Min.   : 1.000     ANY     :     3   Min.   :   3800  
##  1st Qu.: 4.000     MORTGAGE:414900   1st Qu.:  47000  
##  Median : 7.000     NONE    :    44   Median :  65000  
##  Mean   : 7.016     OTHER   :   141   Mean   :  76323  
##  3rd Qu.:11.000     OWN     : 77211   3rd Qu.:  90000  
##  Max.   :11.000     RENT    :333696   Max.   :9500000  
##                                                        
##       verification_status               purpose         addr_state    
##  Not Verified   :256052   debt_consolidation:490336   CA     :121460  
##  Source Verified:315729   credit_card       :191638   NY     : 68763  
##  Verified       :254214   home_improvement  : 47480   TX     : 66997  
##                           other             : 38973   FL     : 55810  
##                           major_purchase    : 15973   IL     : 33213  
##                           small_business    :  9765   NJ     : 31263  
##                           (Other)           : 31830   (Other):448489  
##       dti          delinq_2yrs      inq_last_6mths      open_acc    
##  Min.   :  0.00   Min.   : 0.0000   Min.   :0.0000   Min.   : 0.00  
##  1st Qu.: 11.87   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.: 8.00  
##  Median : 17.57   Median : 0.0000   Median :0.0000   Median :11.00  
##  Mean   : 18.05   Mean   : 0.3161   Mean   :0.6848   Mean   :11.61  
##  3rd Qu.: 23.81   3rd Qu.: 0.0000   3rd Qu.:1.0000   3rd Qu.:14.00  
##  Max.   :380.53   Max.   :39.0000   Max.   :8.0000   Max.   :90.00  
##                                                                     
##     pub_rec          revol_bal         revol_util       total_acc     
##  Min.   : 0.0000   Min.   :      0   Min.   :  0.00   Min.   :  2.00  
##  1st Qu.: 0.0000   1st Qu.:   6544   1st Qu.: 37.90   1st Qu.: 17.00  
##  Median : 0.0000   Median :  12018   Median : 56.10   Median : 24.00  
##  Mean   : 0.1885   Mean   :  17047   Mean   : 55.23   Mean   : 25.34  
##  3rd Qu.: 0.0000   3rd Qu.:  21020   3rd Qu.: 73.70   3rd Qu.: 32.00  
##  Max.   :86.0000   Max.   :2904836   Max.   :892.30   Max.   :169.00  
##                                                                       
##  initial_list_status   application_type  acc_now_delinq     
##  f:427301            INDIVIDUAL:825607   Min.   : 0.000000  
##  w:398694            JOINT     :   388   1st Qu.: 0.000000  
##                                          Median : 0.000000  
##                                          Mean   : 0.005021  
##                                          3rd Qu.: 0.000000  
##                                          Max.   :14.000000  
##                                                             
##  tot_coll_amt_gt0  tot_cur_bal      total_rev_hi_lim  perc_funded_amnt_inv
##  FALSE:718547     Min.   :      0   Min.   :      0   Min.   :0.0000      
##  TRUE :107448     1st Qu.:  33375   1st Qu.:  14878   1st Qu.:1.0000      
##                   Median : 103842   Median :  26000   Median :1.0000      
##                   Mean   : 142008   Mean   :  32252   Mean   :0.9972      
##                   3rd Qu.: 199264   3rd Qu.:  38100   3rd Qu.:1.0000      
##                   Max.   :8000078   Max.   :9999999   Max.   :1.0000      
##                                                                           
##     term       
##  Long :252704  
##  Short:573291  
##                
##                
##                
##                
## 

We examined a few associations that we thought would likely be meaningful visually. For example, we would want to include “State” in our model if it were likely to be meaningful in terms of loan status outcome, but with 51 levels, it would add a lot of parameters to our model if not necessary and impact our inference.

state.dat  <- loan.dat %>%
  group_by(addr_state) %>%
  summarise(state.prop.default =  mean(loan_status_bin == "Bad", na.rm = TRUE),
            state.prop.gradeA = sum(grade_ordinal == 1)/n(),
            state.prop.gradeB = sum(grade_ordinal == 2)/n(),
            state.prop.gradeC = sum(grade_ordinal == 3)/n(),
            state.prop.gradeD = sum(grade_ordinal == 4)/n(),
            state.prop.gradeE = sum(grade_ordinal == 5)/n(),
            state.prop.gradeF = sum(grade_ordinal == 6)/n(),
            state.prop.gradeG = sum(grade_ordinal == 7)/n()) %>%
  mutate(state = addr_state)

#head(state.dat)

plot_usmap(data = state.dat, regions = "state", values = "state.prop.default") +
  scale_fill_continuous(name = "") + 
  ggtitle("Proportion of Loan Defaults by State") + 
  theme(legend.position = "right")

We see some variation from state to state, but did not observe regional trends which would allow us to collapse our states into a lower number of categories.

We also hypothesized that Debt-to-Income Ratio (DTI) might affect loan repayment, and examined the distribution of this variable.

loan.dat %>%
  filter(!is.na(loan_status_bin)) %>%
  ggplot() + 
  geom_boxplot(aes(y = dti, x = loan_status_bin,color = grade_ordinal)) +
  scale_color_brewer(name = "LendingClub Grade",palette = "Blues") +
  theme_dark() +
  scale_y_continuous(name = "Debt-to-Income Ratio") +
  scale_x_discrete(name = "Grade") +
  ggtitle("Distributions of DTI by LendingClub Loan Grade and Loan Status")

#excluding 2 outliers
loan.dat %>%
  filter(dti < 7500 & !is.na(loan_status_bin)) %>%
  ggplot() + 
  geom_boxplot(aes(y = dti, x = loan_status_bin,color = grade)) +
  scale_color_brewer(name = "LendingClub Grade",palette = "Blues") +
  theme_dark() +
  scale_y_continuous(name = "Debt-to-Income Ratio") +
  scale_x_discrete(name = "Grade") +
  ggtitle("Distributions of DTI by LendingClub Loan Grade and Loan Status")

#sqrt transform
loan.dat %>%
  filter(!is.na(loan_status_bin)) %>%
  ggplot() + 
  geom_boxplot(aes(y = sqrt(dti), x = loan_status_bin,color = grade_ordinal)) +
  scale_color_brewer(name = "LendingClub Grade",palette = "Blues") +
  theme_dark() +
  scale_y_continuous(name = "Debt-to-Income Ratio") +
  scale_x_discrete(name = "Grade") +
  ggtitle("Distributions of DTI by LendingClub Loan Grade and Loan Status")

#overall distribution
loan.dat %>%
  filter(!is.na(loan_status_bin)) %>%
  ggplot() + 
  geom_boxplot(aes(y = sqrt(dti), x = loan_status_bin,color = loan_status_bin)) +
  scale_color_brewer(name = "Loan Status",palette = "Blues") +
  theme_dark() +
  scale_y_continuous(name = "Debt-to-Income Ratio") +
  scale_x_discrete(name = "Loan Status") +
  ggtitle("Distributions of DTI by Loan Status")

Building the Prediction Model

In building our prediction model, we considered three different models*:
1. Ridge
2. Lasso
3. Elastic Net

Among the three options, we picked our final model based on the best test performance. We were concerned by the large imbalance in classes in our data set. We only had 7.4% of “Bad” laon status. Therefore, we decided to evalute our model based on the AUC performance because this metric is less affected by the imbalance in the classes. We kept about a fifth of a data (200,000 records) for testing and used the rest for training our models. We trained our candidate models using 5-fold cross-validation (on the training set) to obtain the optimal tuning parameters and finally tested their performance on the test data set.

*Warning: Due to the large number of records and predictors, the models took a long time to train. The training time was approximately ~ 4 hours.

setwd(set_dir)
load('./loan.dat.Rdata')
start_time <- Sys.time()

summary(loan.dat$loan_status_bin)
56958/769037 # only 7.4% of bad status

# This chunk takes a very very long time to run!!!
# I ran this overnight on my laptop and 
# saved the results
# That's why eval is set to FALSE

# keep a fifth of the data set to assess test performance
set.seed(1)
test.indx <- sample(1:dim(loan.dat)[1], 200000, replace = FALSE)
train.indx <- setdiff(1:dim(loan.dat)[1], test.indx)

summary(loan.dat$loan_status_bin[test.indx])
summary(loan.dat$loan_status_bin[train.indx])

train.control = trainControl(method="repeatedcv", number=5, repeats=3, 
                    classProbs=T, summaryFunction=twoClassSummary)

models = c("ridge", "lasso", "enet")
n_models = length(models)

AUC = rep(0,n_models)
names(AUC) = models
for (m in 1:n_models) {
    
    print_str <- paste("Training model: ",
                       models[m], sep='')
    print(print_str)
    
    # save our results
    if (models[m] == 'ridge'){
      
      fit = train(loan_status_bin ~., 
                  data=loan.dat[train.indx, ], 
                  method="glmnet", metric="ROC", trControl=train.control,
                  tuneGrid=expand.grid(alpha = 0, lambda = .5 ^ (-20:20)))  
      
      fit.ridge <- fit
    } else if (models[m] == 'lasso'){
      
      fit = train(loan_status_bin ~., 
                  data=loan.dat[train.indx, ], 
                  method="glmnet", metric="ROC", trControl=train.control,
                  tuneGrid=expand.grid(alpha = 1, lambda = .5 ^ (-20:20)))      

      fit.lasso <- fit
    } else if (models[m] == 'enet'){
      
      fit = train(loan_status_bin ~., 
                  data=loan.dat[train.indx, ], 
                  method="glmnet", metric="ROC", trControl=train.control,
                  tuneGrid=expand.grid(alpha = seq(.05,.95,.05), lambda = .5 ^ (-20:20)))     

      fit.enet <- fit
    } else if (models[m] == 'adaboost'){

      fit = train(loan_status_bin ~., 
            data=loan.dat[train.indx, ], 
            method=models[m], metric="ROC", trControl=train.control)
  
      fit.adaboost <- fit
      
    } else if (models[m] == 'xgbTree'){

      fit = train(loan_status_bin ~., 
            data=loan.dat[train.indx, ], 
            method=models[m], metric="ROC", trControl=train.control)
  
      fit.xgb <- fit
      
    } else if (models[m] == 'C5.0Cost'){

      fit = train(loan_status_bin ~., 
            data=loan.dat[train.indx, ], 
            method=models[m], metric="ROC", trControl=train.control)
  
      fit.ccost <- fit
      
    } else if (models[m] == 'svmLinearWeights'){

      fit = train(loan_status_bin ~., 
            data=loan.dat[train.indx, ], 
            method=models[m], metric="ROC", trControl=train.control)
  
      fit.svm <- fit
      
    } else('Unknown model type!')
    
    probs = predict(fit, loan.dat[test.indx,], type="prob")
    
    R = roc(loan.dat$loan_status_bin[test.indx], probs$Good)
    
    plot.roc(R, add=(m>1), col=m, lwd=2, main="ROC curves")
    
    legend("bottomright", legend=models, col=1:n_models, lwd=2)

    AUC[m] = R$auc
}
AUC

end_time <- Sys.time()
time <- end_time - start_time
time

setwd(set_dir)
save(fit.ridge, fit.lasso, fit.enet, file = "./models.RData")
## [1] "ridge"

## [1] "lasso"
## [1] "enet"
##     ridge     lasso      enet 
## 0.7073157 0.7077265 0.7077178

The AUC for all three models are similar. However, lasso model (with \(\alpha=1\) and \(\lambda=0.0001220703\)) had the largest AUC. Therefore, we decided to use the elastic net model (with \(\alpha=1\) and \(\lambda=0.0001220703\)) as our final model and retrained the elastic net model with these parameters using all data.

x = model.matrix(loan_status_bin ~ ., loan.dat)[,-1]
y = loan.dat$loan_status_bin

final_mod = glmnet(x=x, y=y, family = 'binomial', alpha = 1, 
                   lambda = 0.0001220703)


setwd(set_dir)
save(final_mod, file = "./final_model.RData")

Results

Here is our final elastic net model (\(\alpha=1\) and \(\lambda=0.0001220703\)) with the following regression coefficients:

setwd(set_dir)
#load("final_model.RData")
coef(final_mod)
## 92 x 1 sparse Matrix of class "dgCMatrix"
##                                               s0
## (Intercept)                         2.126283e+00
## loan_amnt                          -6.750296e-06
## funded_amnt                         .           
## grade_ordinal                      -3.122226e-01
## emp_length_ordinal                  7.644084e-03
## home_ownershipMORTGAGE              2.317965e-03
## home_ownershipNONE                 -3.335335e-01
## home_ownershipOTHER                -5.775394e-01
## home_ownershipOWN                   .           
## home_ownershipRENT                 -1.271860e-01
## annual_inc                          4.555097e-06
## verification_statusSource Verified  7.753753e-02
## verification_statusVerified        -1.087390e-01
## purposecredit_card                  1.577752e-01
## purposedebt_consolidation           .           
## purposeeducational                 -4.108364e-01
## purposehome_improvement             .           
## purposehouse                       -2.484775e-02
## purposemajor_purchase               .           
## purposemedical                     -1.028531e-01
## purposemoving                      -2.943002e-02
## purposeother                       -3.470342e-02
## purposerenewable_energy            -1.101536e-01
## purposesmall_business              -5.684232e-01
## purposevacation                     2.839905e-02
## purposewedding                     -1.683520e-01
## addr_stateAL                       -1.177174e-01
## addr_stateAR                        .           
## addr_stateAZ                       -3.691371e-02
## addr_stateCA                       -9.251343e-02
## addr_stateCO                        1.610975e-01
## addr_stateCT                        8.365280e-02
## addr_stateDC                        3.818644e-01
## addr_stateDE                        .           
## addr_stateFL                       -1.531459e-01
## addr_stateGA                        1.720272e-02
## addr_stateHI                       -1.190482e-01
## addr_stateIA                       -8.190531e-02
## addr_stateID                        .           
## addr_stateIL                        1.824187e-01
## addr_stateIN                        3.678528e-02
## addr_stateKS                        1.539369e-01
## addr_stateKY                        1.526397e-02
## addr_stateLA                       -8.431524e-02
## addr_stateMA                       -4.081359e-02
## addr_stateMD                       -8.468537e-02
## addr_stateME                        2.471315e+00
## addr_stateMI                        .           
## addr_stateMN                       -4.660058e-03
## addr_stateMO                        .           
## addr_stateMS                        3.633531e-01
## addr_stateMT                        9.628141e-02
## addr_stateNC                       -7.785504e-02
## addr_stateND                        1.600109e+00
## addr_stateNE                        1.633703e+00
## addr_stateNH                        2.640157e-01
## addr_stateNJ                       -1.183487e-01
## addr_stateNM                       -8.512447e-02
## addr_stateNV                       -2.545912e-01
## addr_stateNY                       -1.260648e-01
## addr_stateOH                        4.187263e-03
## addr_stateOK                       -4.815880e-02
## addr_stateOR                        3.112655e-02
## addr_statePA                        2.136455e-03
## addr_stateRI                       -3.935880e-04
## addr_stateSC                        1.443052e-01
## addr_stateSD                       -1.651235e-03
## addr_stateTN                        2.167357e-02
## addr_stateTX                        7.209071e-02
## addr_stateUT                       -1.004820e-01
## addr_stateVA                       -1.299379e-01
## addr_stateVT                        2.788129e-01
## addr_stateWA                        4.355436e-02
## addr_stateWI                        1.073420e-01
## addr_stateWV                        1.579249e-01
## addr_stateWY                        2.204878e-01
## dti                                 8.014140e-04
## delinq_2yrs                         1.155930e-02
## inq_last_6mths                     -1.828347e-01
## open_acc                            3.228885e-03
## pub_rec                             1.617830e-01
## revol_bal                           2.759894e-06
## revol_util                         -4.939832e-03
## total_acc                          -1.947432e-03
## initial_list_statusw                5.882670e-01
## application_typeJOINT               2.047160e+00
## acc_now_delinq                      .           
## tot_coll_amt_gt0TRUE                1.816111e-01
## tot_cur_bal                         1.381133e-07
## total_rev_hi_lim                    .           
## perc_funded_amnt_inv                1.315427e+00
## termShort                           3.201315e-03

Assocation

We also ran a series of simple logistic regressions to explore factors that are associated with defaulting a loan. We recoded loan status into a dummy variable with 1 being “Bad”. The predictors are:

  • Funded amount
  • Grade
  • Employment length
  • annual income
  • Debt-to-Income Ratio (DTI)
  • The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
  • The number of open account
  • Total credit revolving balance
  • Revolving line utilization rate
  • The total number of credit lines currently in the borrower’s credit file
  • Total current balance of all accounts
  • Total revolving high credit/credit limit
#recode loan status to a dummy variable, grade and employment length to continuous variables 
loan.dat <- loan.dat %>%
  mutate(status_dum = ifelse(loan_status_bin == "Good",0,1))

#prepare data subset 
varlist <- c('status_dum','funded_amnt','grade_ordinal',
             'emp_length_ordinal', 'annual_inc','dti',
             'inq_last_6mths','open_acc',
             'revol_bal','revol_util','total_acc',
             'tot_cur_bal','total_rev_hi_lim')

loan.dat_sub <- loan.dat[, varlist]

#run simple logistic regressions 

glm_fun <- function(x) {
  fit <- glm(status_dum ~ x, data = loan.dat_sub, family = "binomial")
  out <- coef(summary(fit))
  return(out)
}

simple_result <- lapply(loan.dat_sub[,2:13],function(x) glm_fun(x))

Here is the result for each logistic regression. Note that the coefficient in in log form.

simple_result_df <- matrix(nrow = 12, ncol = 4)
colnames(simple_result_df) <- c("Estimate","SE","z value","p-value")
rownames(simple_result_df) <- names(simple_result)
for (i in 1:12){
  simple_result_df[i,] <- simple_result[[i]][2,]
}

tidy(simple_result_df)

While all of estimates are statistically significant at 5% confidence level, the magnitude of effect is negligible for most of these variables. We present the odds ratio (OR) and 95% confidence intervals for three variables with large effect size: grade, employment length, and Inquiries in the last 6 months.

#this is a function of exponentiate the coefficient and get the confidence intervals of each estimate 
glm_CI <- function(x) {
  fit <- glm(status_dum ~ x, data = loan.dat_sub, family = "binomial")
  out <- exp(cbind(OR = coef(fit), confint(fit)))
  return(out)}

Grade

This result suggests that for a one unit increase in grade, which represents a downgrading of one level (i.e., from A to B), the odds of defaulting a loan increases by 49%.

glm_CI(loan.dat_sub$grade_ordinal)
##                     OR      2.5 %     97.5 %
## (Intercept) 0.02140505 0.02093053 0.02188912
## x           1.49241807 1.48341644 1.50147486

Employment length

For a one-year increase in employment length (up to 10 years), the odds of defaulting a loan decreases by 2%

glm_CI(loan.dat_sub$emp_length_ordinal)
##                     OR      2.5 %     97.5 %
## (Intercept) 0.08315479 0.08167605 0.08465545
## x           0.98340649 0.98113713 0.98568193

Inquiries in the last 6 months

For a one unit increase in the number of inquiries in past 6 months (excluding auto and mortgage inquiries), the odds of defaulting a loan increases by 29%. A recent inquiry might suggest having insufficient fund to make the upcoming repayment.

glm_CI(loan.dat_sub$inq_last_6mths)
##                     OR     2.5 %     97.5 %
## (Intercept) 0.06025912 0.0596011 0.06092263
## x           1.29462778 1.2849837 1.30431849

We note that grade is a function of many other factors (e.g. employment length, annual income). Therefore, we also ran a multivariate logistic regression to condition out grade assignment.

We see that after stratifying by grade, the effects of employment length and inquiries in the last 6 months on loan repayment remain statistically significant, although the odds ratio of inquiries in the last 6 months decreases discernably. This shows that in each grade stratum, the magnitude of the effect of inquiries in the last 6 months on loan default becomes smaller.

reg_fit <- glm(status_dum ~ grade_ordinal + emp_length_ordinal + inq_last_6mths, data=loan.dat, family = "binomial")
exp(cbind(OR = coef(reg_fit), confint(reg_fit)))
##                            OR      2.5 %     97.5 %
## (Intercept)        0.02310968 0.02248185 0.02375353
## grade_ordinal      1.45616035 1.44715249 1.46522405
## emp_length_ordinal 0.98207577 0.97977778 0.98437994
## inq_last_6mths     1.16910764 1.15996498 1.17829786